Course Logistics

Course materials

Teachers

Qixiang

Mees
Daniel

Pablo

Program

Time Monday Tuesday Wednesday Thursday
9:00 - 10:30 Lecture 1 Lecture 3 Lecture 5 Lecture 7
Break Break Break Break
10:45 – 11:45 Practical 1 Practical 3 Practical 5 Practical 7
11:45 – 12:30 Discussion 1 Discussion 3 Discussion 5 Discussion 7
Lunch Lunch Lunch Lunch
13:45 – 15:15 Lecture 2 Lecture 4 Lecture 6 Lecture 8
Break Break Break Break
15:30 – 16:30 Practical 2 Practical 4 Practical 6 Practical 8
16:30 – 17:00 Discussion 2 Discussion 4 Discussion 6 Discussion 8

Goal of the course

  • Text data is everywhere!
  • A lot of world’s data is in unstructured text format
  • The course teaches
    • text mining techniques
    • using R
    • on a variety of applications
    • in many domains.

What is Text Mining?

Text mining in an example

  • This is Garry!

  • Garry works at Bol.com (a webshop in the Netherlands)

  • He works in the dep of Customer relationship management.

  • He uses Excel to read and search customers’ reviews, extract aspects they wrote their reviews on, and identify their sentiments.

  • Curious about his job? See two examples!

This is a nice book for both young and old. It gives beautiful life lessons in a fun way. Definitely worth the money!

+ Educational

+ Funny

+ Price


Nice story for older children.

+ Funny

- Readability

Example

  • Garry likes his job a lot, but sometimes it is frustrating!

  • This is mainly because their company is expanding quickly!

  • Garry decides to hire Larry as his assistant.

Example

  • Still, a lot to do for two people!

  • Garry has some budget left to hire another assistant for couple of years!

  • He decides to hire Harry too!

  • Still, manual labeling using Excel is labor-intensive!

Challenges?

  • What are the challenges they encounter in working with text?

Language is hard!

  • Different things can mean more or less the same (“data science” vs. “statistics”)
  • Context dependency (“You have very nice shoes”);
  • Same words with different meanings (“to sanction”, “bank”);
  • Lexical ambiguity (“we saw her duck”)
  • Irony, sarcasm (“That’s just what I needed today!”, “Great!”, “Well, what a surprise.”)
  • Figurative language (“He has a heart of stone”)
  • Negation (“not good” vs. “good”), spelling variations, jargon, abbreviations
  • All the above are different over languages, 99% of work is on English!

Text Mining to the Rescue!

Text mining

  • “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources” Hearst (1999)

  • Text mining is about looking for patterns in text, in a similar way that data mining can be loosely described as looking for patterns in data.

  • Text mining describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources. (Wikipedia)

Language is hard!

  • We won’t solve linguistics …
  • In spite of the problems, text mining can be quite effective!

Examples & Applications

Text mining applications

Who wrote the Wilhelmus?

Text Classification

Which ICD-10 codes should I give this doctor’s note?

Sentiment Analysis / Opinion Mining

Statistical Machine Translation

Dialog Systems

Question Answering

Go beyond search

Which studies go in my systematic review?

And more …

  • Automatically classify political news from sports news

  • Authorship identification

  • Age/gender identification

  • Language Identification

Process & Tasks

Text mining process

Pattern discovery tasks in text

  • Text classification
  • Text clustering
  • Sentiment analysis
  • Feature selection
  • Topic modelling
  • Responsible text mining
  • Text summarization

And more in NLP

10-minute break


Regular Expressions

Regular expressions

Regular expressions



In computing, a regular expression, also referred to as “regex” or “regexp”, provides a concise and flexible means for matching strings of text, such as particular characters, words, or patterns of characters. A regular expression is written in a formal language that can be interpreted by a regular expression processor.





http://en.wikipedia.org/wiki/Regular_expression

Regular expressions

  • A formal language for specifying text strings

  • How can we search for any of these?

    • netherland

    • netherlands

    • Netherland

    • Netherlands

Some simple regex searches

Disjunction

The use of the brackets [] to specify a disjunction of characters:

The pipe sympol | is also for disjunction:

Brackets and dash

The use of the brackets [] plus the dash - to specify a range:

Negation

The caret ^ for negation or just to mean ^:

Question and period marks

The question mark ? marks optionality of the previous expression:

The use of the period . to specify any character:

Anchors

Common sets

Aliases for common sets of characters:

The backslash for escaping!

Operators for counting

  • Patterns are greedy: In these cases regular expressions always match the largest string they can, expanding to cover as much of a string as they can.

  • Enforce non-greedy matching, using another meaning of the ? qualifier.

    • The operator *? is a Kleene star that matches as little text as possible.
    • The operator +? is a Kleene plus that matches as little text as possible.

Other

Some characters that need to be backslashed:

Operator precedence hierarchy

Understanding Regular Expressions

  • Very powerful and quite cryptic

  • Fun once you understand them

  • Regular expressions are a programming language with characters

  • It is kind of an “old school” language

In R

The primary R functions for dealing with regular expressions are:

  • grep(), grepl(): Search for matches of a regular expression/pattern in a character vector

  • regexpr(), gregexpr(): Search a character vector for regular expression matches and return the indices where the match begins; useful in conjunction with regmatches()

  • sub(), gsub(): Search a character vector for regular expression matches and replace that match with another string

  • The stringr package provides a series of functions implementing much of the regular expression functionality in R but with a more consistent and rationalized interface.

Example

  • Find all instances of the word “the” in a text.
the

Misses capitalized examples



[tT]he

Incorrectly returns words such as other or Netherlands



[^a-zA-Z] [tT]he [^a-zA-Z]

Still not compeletly correct! What is missing?

Example

txt <- "The other the Netherlands will then be without the"
r   <- gregexpr("[^a-zA-Z][tT]he[^a-zA-Z]", txt)
print(regmatches(txt, r))
## [[1]]
## [1] " the "
r   <- gregexpr("(^|[^a-zA-Z])[tT]he($|[^a-zA-Z])", txt)
print(regmatches(txt, r))
## [[1]]
## [1] "The "  " the " " the"

Errors

  • The process we just went through was based on fixing two kinds of errors

    • Matching strings that we should not have matched (there, Netherlands)

      • False positives (Type I)
    • Not matching things that we should have matched (The)

      • False negatives (Type II)

Errors cont.

  • In NLP we are always dealing with these kinds of errors.

  • Reducing the error rate for an application often involves:

    • Increasing precision (minimizing false positives)

    • Increasing recall (minimizing false negatives).

Question

Given the text “this Summer School is Utrecht summer school”, which RE finds all the “summer school” mentions?

  • (^|[^a-zA-Z])[sS]ummer *[sS]chool([^a-zA-Z])
  • (^|[^a-zA-Z])[sS]ummer\s[sS]chool($|[^a-zA-Z])
  • (^|[^a-zA-Z])[sS]ummer *[sS]chool($|[^a-zA-Z])
  • [^a-zA-Z][sS]ummer\s[sS]chool($|[^a-zA-Z])

Question

Solution

txt <- "this Summer   School is utrecht summerschool"
r <- gregexpr("(^|[^a-zA-Z])[sS]ummer *[sS]chool($|[^a-zA-Z])", txt)
#r <- gregexpr("(^|[^a-zA-Z])[sS]ummer\\s*[sS]chool($|[^a-zA-Z])", txt)

print(regmatches(txt, r))
## [[1]]
## [1] " Summer   School " " summerschool"

Question

Suppose we want to build an application to help a user buy a car from textual catalogues. The user looks for any car cheaper than $10,000.00.

Assume we are using the following data: txt <- c(“Price of Tesla S is $8599.99.”, “Audi Q4 is $7000.”, “BMW X5 costs $900”)

Which RE will help us to do this?

  • (ˆ|\W)\$[0-9]{0,4}(\.[0-9][0-9])*
  • (ˆ|\W)\$[0-9]{0,3}(\.[0-9][0-9])+
  • (ˆ|\W)\$[0-9]{0,4}(\.[0-9][0-9])?
  • (ˆ|\W)\$[0-9][0-9][0-9][0-9](\.[0-9][0-9])*

Question

Solution

txt <- c("Price of Tesla S is $8599.99.", 
         "Audi Q4 is $7000.", 
         "BMW X5 costs $900") 
r <- gregexpr("(ˆ|\\W)\\$[0-9]{0,4}(\\.[0-9][0-9])?", txt)
print(regmatches(txt, r))
## [[1]]
## [1] " $8599.99"
## 
## [[2]]
## [1] " $7000"
## 
## [[3]]
## [1] " $900"

Summary

Summary

  • Text data is everywhere!
  • Language is hard!
  • Sophisticated sequences of regular expressions are often the first model for any text processing tool
  • Regular expressions are a cryptic but powerful language for matching strings and extracting elements from those strings
  • The basic problem of text mining is that text is not a neat data set
  • One solution: text pre-processing

Next: Text preprocessing

  • is an approach for cleaning and noise removal of text data.
  • brings your text into a form that is analyzable for your task.
  • transforms text into a more digestible form so that machine learning algorithms can perform better.

Practical 1

Are you curious about the end of the Example?

  • During one of the coffee moments at the company, Garry was talking about their situation at the dep of Customer relationship management.

  • When Carrie, her colleague from the Data Science department, hears the situation, she offers Garry to use Text Mining!!

  • She says: “ Text mining is your friend; it can help you to make the process way faster than Excel by filtering words and recommending labels.

  • She continues : “Text mining is a subfield of AI and NLP and is related to data science, data mining and machine learning.”

  • After consulting with Larry and Harry, they decide to give text mining a try!

End